Self-supervised image denoising techniques emerged as convenient methods that allow training denoising models without requiring ground-truth noise-free data. Existing methods usually optimize loss metrics that are calculated from multiple noisy realizations of similar images, e.g., from neighboring tomographic slices. However, those approaches fail to utilize the multiple contrasts that are routinely acquired in medical imaging modalities like MRI or dual-energy CT. In this work, we propose the new self-supervised training scheme Noise2Contrast that combines information from multiple measured image contrasts to train a denoising model. We stack denoising with domain-transfer operators to utilize the independent noise realizations of different image contrasts to derive a self-supervised loss. The trained denoising operator achieves convincing quantitative and qualitative results, outperforming state-of-the-art self-supervised methods by 4.7-11.0%/4.8-7.3% (PSNR/SSIM) on brain MRI data and by 43.6-50.5%/57.1-77.1% (PSNR/SSIM) on dual-energy CT X-ray microscopy data with respect to the noisy baseline. Our experiments on different real measured data sets indicate that Noise2Contrast training generalizes to other multi-contrast imaging modalities.
translated by 谷歌翻译
Incorporating computed tomography (CT) reconstruction operators into differentiable pipelines has proven beneficial in many applications. Such approaches usually focus on the projection data and keep the acquisition geometry fixed. However, precise knowledge of the acquisition geometry is essential for high quality reconstruction results. In this paper, the differentiable formulation of fan-beam CT reconstruction is extended to the acquisition geometry. This allows to propagate gradient information from a loss function on the reconstructed image into the geometry parameters. As a proof-of-concept experiment, this idea is applied to rigid motion compensation. The cost function is parameterized by a trained neural network which regresses an image quality metric from the motion affected reconstruction alone. Using the proposed method, we are the first to optimize such an autofocus-inspired algorithm based on analytical gradients. The algorithm achieves a reduction in MSE by 35.5 % and an improvement in SSIM by 12.6 % over the motion affected reconstruction. Next to motion compensation, we see further use cases of our differentiable method for scanner calibration or hybrid techniques employing deep models.
translated by 谷歌翻译
低剂量计算机断层扫描(CT)降级算法旨在使常规CT采集中的患者剂量减少,同时保持高图像质量。最近,引入了深度学习〜(DL)的方法,由于其高模型容量,因此在此任务上的常规降级算法优于常规deno。但是,为了过渡基于DL的denoing到临床实践,这些数据驱动的方法必须超越可见的训练数据来概括地概括。因此,我们提出了一种由一组可训练的联合双边滤波器(JBF)组成的混合脱糖性方法,并结合了基于卷积DL的deNoising网络,以预测指导图像。我们提出的denoising管道结合了通过基于DL的功能提取和常规JBF的可靠性启用的高模型容量。通过在没有金属植入物的腹部CT扫描上进行训练以及对金属植入物以及头部CT数据进行腹部扫描测试,可以证明该管道的概括能力。当我们的管道中嵌入两个基于DL的DENOISER(RED-CNN/QAE)时,Denoisis的性能提高了$ 10 \,\%$/$ 82 \,\%$(RMSE)和$ 3 \,\%$ /$ 81 \,\%$(psnr)在包含金属的区域和$ 6 \,\%$/$ 78 \,\%$(rmse)和$ 2 \,\%$/$ 4 \,\%$(psnr)上与各自的香草模型相比,头部CT数据。最后,提出的可训练的JBFS限制了深神经网络的误差结合,以促进基于DL的DeOisers在低剂量CT管道中的适用性。
translated by 谷歌翻译
无监督的域适应性(UDA)旨在将所学的知识从标记的源域转移到未标记的目标域。在UDA的背景下,对比度学习(CL)可以帮助更好地在特征空间中分开类。然而,在图像分割中,由于像素对比度损失的计算,较大的记忆足迹使其使用过度。此外,在医学成像中不容易获得标记的目标数据,并且获得新样品并不经济。结果,在这项工作中,当只有几个(几个)或单个(OneShot)图像可从目标域中获得时,我们将解决更具挑战性的UDA任务。我们应用样式转移模块来减轻目标样本的稀缺性。然后,为了使源和目标特征保持一致并解决传统对比损失的记忆问题,我们提出了基于质心的对比度学习(CCL)和质心规范规则器(CNR),以在方向和幅度上优化对比度对。此外,我们提出了多区域质心学习(MPCCL),以进一步降低目标特征的差异。对MS-CMRSEG数据集的几乎没有Shot评估表明,与基线相比,Cunduda在目标域上的分割性能提高了0.34的骰子分数,并且在更严格的Oneshot设置中提高了0.31骰子分数。
translated by 谷歌翻译
Speech-driven 3D facial animation has been widely explored, with applications in gaming, character animation, virtual reality, and telepresence systems. State-of-the-art methods deform the face topology of the target actor to sync the input audio without considering the identity-specific speaking style and facial idiosyncrasies of the target actor, thus, resulting in unrealistic and inaccurate lip movements. To address this, we present Imitator, a speech-driven facial expression synthesis method, which learns identity-specific details from a short input video and produces novel facial expressions matching the identity-specific speaking style and facial idiosyncrasies of the target actor. Specifically, we train a style-agnostic transformer on a large facial expression dataset which we use as a prior for audio-driven facial expressions. Based on this prior, we optimize for identity-specific speaking style based on a short reference video. To train the prior, we introduce a novel loss function based on detected bilabial consonants to ensure plausible lip closures and consequently improve the realism of the generated expressions. Through detailed experiments and a user study, we show that our approach produces temporally coherent facial expressions from input audio while preserving the speaking style of the target actors.
translated by 谷歌翻译
Generating realistic 3D worlds occupied by moving humans has many applications in games, architecture, and synthetic data creation. But generating such scenes is expensive and labor intensive. Recent work generates human poses and motions given a 3D scene. Here, we take the opposite approach and generate 3D indoor scenes given 3D human motion. Such motions can come from archival motion capture or from IMU sensors worn on the body, effectively turning human movement in a "scanner" of the 3D world. Intuitively, human movement indicates the free-space in a room and human contact indicates surfaces or objects that support activities such as sitting, lying or touching. We propose MIME (Mining Interaction and Movement to infer 3D Environments), which is a generative model of indoor scenes that produces furniture layouts that are consistent with the human movement. MIME uses an auto-regressive transformer architecture that takes the already generated objects in the scene as well as the human motion as input, and outputs the next plausible object. To train MIME, we build a dataset by populating the 3D FRONT scene dataset with 3D humans. Our experiments show that MIME produces more diverse and plausible 3D scenes than a recent generative scene method that does not know about human movement. Code and data will be available for research at https://mime.is.tue.mpg.de.
translated by 谷歌翻译
We propose ClipFace, a novel self-supervised approach for text-guided editing of textured 3D morphable model of faces. Specifically, we employ user-friendly language prompts to enable control of the expressions as well as appearance of 3D faces. We leverage the geometric expressiveness of 3D morphable models, which inherently possess limited controllability and texture expressivity, and develop a self-supervised generative model to jointly synthesize expressive, textured, and articulated faces in 3D. We enable high-quality texture generation for 3D faces by adversarial self-supervised training, guided by differentiable rendering against collections of real RGB images. Controllable editing and manipulation are given by language prompts to adapt texture and expression of the 3D morphable model. To this end, we propose a neural network that predicts both texture and expression latent codes of the morphable model. Our model is trained in a self-supervised fashion by exploiting differentiable rendering and losses based on a pre-trained CLIP model. Once trained, our model jointly predicts face textures in UV-space, along with expression parameters to capture both geometry and texture changes in facial expressions in a single forward pass. We further show the applicability of our method to generate temporally changing textures for a given animation sequence.
translated by 谷歌翻译
We propose a novel method for high-quality facial texture reconstruction from RGB images using a novel capturing routine based on a single smartphone which we equip with an inexpensive polarization foil. Specifically, we turn the flashlight into a polarized light source and add a polarization filter on top of the camera. Leveraging this setup, we capture the face of a subject with cross-polarized and parallel-polarized light. For each subject, we record two short sequences in a dark environment under flash illumination with different light polarization using the modified smartphone. Based on these observations, we reconstruct an explicit surface mesh of the face using structure from motion. We then exploit the camera and light co-location within a differentiable renderer to optimize the facial textures using an analysis-by-synthesis approach. Our method optimizes for high-resolution normal textures, diffuse albedo, and specular albedo using a coarse-to-fine optimization scheme. We show that the optimized textures can be used in a standard rendering pipeline to synthesize high-quality photo-realistic 3D digital humans in novel environments.
translated by 谷歌翻译
We present Depth-aware Image-based NEural Radiance fields (DINER). Given a sparse set of RGB input views, we predict depth and feature maps to guide the reconstruction of a volumetric scene representation that allows us to render 3D objects under novel views. Specifically, we propose novel techniques to incorporate depth information into feature fusion and efficient scene sampling. In comparison to the previous state of the art, DINER achieves higher synthesis quality and can process input views with greater disparity. This allows us to capture scenes more completely without changing capturing hardware requirements and ultimately enables larger viewpoint changes during novel view synthesis. We evaluate our method by synthesizing novel views, both for human heads and for general objects, and observe significantly improved qualitative results and increased perceptual metrics compared to the previous state of the art. The code will be made publicly available for research purposes.
translated by 谷歌翻译
交互式机器学习(IML)应使智能系统能够从其最终用户进行交互式学习,并迅速变得越来越重要。尽管它将人类置于循环中,但相互作用主要是通过错过上下文信息的相互解释来执行的。此外,Caipi等当前的模型IML策略仅限于“破坏性”反馈,这意味着它们仅允许专家阻止学习者使用无关的功能。在这项工作中,我们提出了一个新颖的互动框架,称为文本域的语义互动学习。我们将将建设性和上下文反馈纳入学习者的问题将其作为找到一个架构的任务,以找到(a)在人与机器之间实现更多的语义对齐,并且(b)同时有助于维持输入域的统计特征。根据有意义的更正生成用户定义的反例。因此,我们介绍了一种称为smanticpush的技术,该技术可有效地将人类对人类的概念校正转换为非排除培训示例,以便将学习者的推理推向所需的行为。在几个实验中,我们表明我们的方法在预测性能以及下游多级分类任务中的局部解释质量方面显然优于Caipi(一种最先进的IML策略)。
translated by 谷歌翻译